Pbm: A new dataset for blog mining
نویسندگان
چکیده
Text mining is becoming vital as Web 2.0 offers collaborative content creation and sharing. Now Researchers have growing interest in text mining methods for discovering knowledge. Text mining researchers come from variety of areas like: Natural Language Processing, Computational Linguistic, Machine Learning, and Statistics. A typical text mining application involves preprocessing of text, stemming and lemmatization, tagging and annotation, deriving knowledge patterns, evaluating and interpreting the results. There are numerous approaches for performing text mining tasks, like: clustering, categorization, sentimental analysis, and summarization. There is a growing need to standardize the evaluation of these tasks. One major component of establishing standardization is to provide standard datasets for these tasks. Although there are various standard datasets available for traditional text mining tasks, but there are very few and expensive datasets for blog-mining task. Blogs, a new genre in web 2.0 is a digital diary of web user, which has chronological entries and contains a lot of useful knowledge, thus offers a lot of challenges and opportunities for text mining. In this paper, we report a new indigenous dataset for Pakistani Political Blogosphere. The paper describes the process of data collection, organization, and standardization. We have used this dataset for carrying out various text mining tasks for blogosphere, like: blogsearch, political sentiments analysis and tracking, identification of influential blogger, and clustering of the blog-posts. We wish to offer this dataset free for others who aspire to pursue further in this domain.
منابع مشابه
A new stochastic 3D seismic inversion using direct sequential simulation and co-simulation in a genetic algorithm framework
Stochastic seismic inversion is a family of inversion algorithms in which the inverse solution was carried out using geostatistical simulation. In this work, a new 3D stochastic seismic inversion was developed in the MATLAB programming software. The proposed inversion algorithm is an iterative procedure that uses the principle of cross-over genetic algorithms as the global optimization techniqu...
متن کاملMINING FUZZY TEMPORAL ITEMSETS WITHIN VARIOUS TIME INTERVALS IN QUANTITATIVE DATASETS
This research aims at proposing a new method for discovering frequent temporal itemsets in continuous subsets of a dataset with quantitative transactions. It is important to note that although these temporal itemsets may have relatively high textit{support} or occurrence within particular time intervals, they do not necessarily get similar textit{support} across the whole dataset, which makes i...
متن کاملMinimizing the Repeated Database Scan Using an Efficient Frequent Pattern Mining Algorithm in Web Usage Mining
Data Mining, is the process of discovery of new patterns and knowledge from large dataset. Web mining is the application of data mining techniques to extract and mine useful knowledge and interesting patterns from World Wide Web .Web data including web documents, hyperlinks between documents, usage logs of web sites. The web usage data captures the identity and origin of the web user along thei...
متن کاملUnderstanding Travel Destinations From Structured Tourism Blogs
The increasing popularity of tourist generated content has created abundant opportunities for people to understand the opinions and experiences of prior tourists. However, till now no framework has been presented to automatically discover useful patterns from structured tourism blogs. In this paper, we present a method to mine the tourism information such as frequented spots and popular travel ...
متن کاملMining Access Patterns Using Clustering
Web usage mining is an application of data mining techniques to discover usage patterns from web data, in order to understand and better serve the needs of web based application. The aim of this paper is to discuss about a system proposed which would perform clustering of user sessions extracted from the web logs.HTML links are extracted from these web logs for each user which constitutes the d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1201.2073 شماره
صفحات -
تاریخ انتشار 2011